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Abstract 

Coalescent histories are combinatorial structures that describe for a given gene tree and species tree the 
possible lists of branches of the species tree on which the gene tree coalescences take place. Properties of the 
number of coalescent histories for gene trees and species trees affect a variety of probabilistic calculations in 
mathematical phylogenetics. Exact and asymptotic evaluations of the number of coalescent histories, however, 
are known only in a limited number of cases. Here we introduce a particular family of species trees, the 
lodgepole species trees (A n )„>o, in which tree X n has m = 2n+ 1 taxa. We determine the number of coalescent 
histories for the lodgepole species trees, in the case that the gene tree matches the species tree, showing that this 
number grows with to!! in the number of taxa to. This computation demonstrates the existence of tree families 
in which the growth in the number of coalescent histories is faster than exponential. Further, it provides a 
substantial improvement on the lower bound for the ratio of the largest number of matching coalescent histories 
to the smallest number of matching coalescent histories for trees with m taxa, increasing a previous bound 
of (v / tt/32)[(5to — 12)/(4m — 6 )\my/m to [y/m — l/(4- v /e)] m . We discuss the implications of our enumerative 
results for phylogenetic computations. 

Key words: coalescence, genealogy, phylogeny. 


1 Introduction 


Advances in the mathematical investigation of gene genealogies and the increasing availability of genetic data from 
diverse taxa have clarified that species trees, representing the branching histories of populations _of organisms, 
need not be refl e cted i n gene t r ees th at represent the histories of individual genomic regions (jPamilo and Neil . 
Il988 : Maddison . 1997 : Nicholsl . l200ll ). New developments concerning the relationship between gene trees and 
species trees have now led to new methods for species tree inference, new approaches to inferences about evolu¬ 
tionary phenomena from gene t ree discordance, and an improved understandi ng of the branching descent process 
( Deenan and Rosenberg . 20091 : Liu et al. . 20091 : Knowles and Kubatkol . 2Q10l l. 

Investigations of the evolution of genomic regions along the branches of species trees have also gener- 
ated new combinatorial structures that can assist in studyin g gene trees and species trees (jMaddisonl . 119971 : 
Degnan and Salter . 20051 : Than and Nakhleh . 2009 : Wu, 2012 1. Among these structures are coalescent histories, 


structures that for a given gene tree topology and species tree topology represent possible pairings of coalescences 
in the gene tree w ith branches of the species tree on which the coalescences take place ( Degnan and Salter . 20051 : 


R osen berg. 2007 ). 


Coalescent histories are important in a number of types of studies of the relationship between gene trees and 
species trees. They have appeared in empirical investigations of the gene tree topologies likely to be produced 
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Figure 1: Natural logarithm of the number of coalescent histories for all matching gene trees and species trees with at 
most 9 taxa. The values plotted are taken from Tables 1-4 of Rosenberg] ( 20071) . Each dot corresponds to a tree of the 
specified size. The line represents a linear regression y = a + bx, with a ss —2.91891 and b ss 1.07865. 


along the branches of a given species tree ( Rosenberg and Tao . 20081 ). They are a component of mathematical 


proofs that concern properties o f evolutionary models of gene trees conditional on species trees ([Allman et al. 


2011 ; Than and Rosenberel. 201 ll ) . Coalescent histories also arise in studying state spaces for models th at co nsider 


transitions along th e genome among the gene genealogies represented at specific sites ([Hobolth et al.l . 120071 . 12011 


Dutheil et al. . 20091 ) 


Many coalescent histories might be possible for a given gene tree and species tree, and the number of possible 
coalescent histories is a key quantity in the study of gene trees and species trees. In particular, because the 
pr obab ility of a ge ne tree topology conditional on a species tree can be written as a sum over coalescent histories 
( Degnan and Salter . 20051 ). the time required for computing gene tree probabilities is proportional to the number 
of coalescent histories compatible with a given gene tree and species tree. Thus, to study computational aspects 
of the use of coalescent histories, it has been of interest to evaluate the number of coalescent histories permissible 
for a giv en pair consisting_of a gene tree and a species tree. 

Degnan and Salter ( 20051 ). who initiated the study of coalescent histories, reported that if the labeled gene 


tree topology and species tree topology have the same matching “caterpillar” shape with m taxa, then the number 
of coalescent histories is the Catalan number, 


Cm—1 — 


1 (2m -2 
m — 1 


m 


a) 


The Catalan sequence c m is asymptotic to 4 m /(?n 3//2 y / i : ). Rosenberg ( 2007 ) and Than et al. (2007provided 
recurs ive procedures that list all possible coalescent histori es given a gene tree and spec ies tree, and iRosenberg 


( 2007 ) offered simple recursive formulas for counting them. Rosenberg (2007, 201. 'll ) and Rosenberg and Degnan 
(2010) then solved the recursion in a number of specific cases. 

What is the asymptotic behavior of the number of coalescent histories as the number of taxa increases? In 
Figure [U we show values taken from [Rosenberg! ( 2007 ) for the number of coalescent histories for matching gene 
trees, for all species trees with m < 9 taxa. On a logarithmic scale, a linear model fits the values quite well, 
suggesting that in general, the number of coalescent histories for matching gene trees and species trees might 
grow exponentially in the number of taxa. Existing enumerations of coa lesce nt his torie s in particular cases, both 
for the caterpillar trees in eq. |T]and in related caterpillar-like families ( Rosenberg . 20131 ). support this prediction. 
We might therefore expect that for a generic family of species trees of increasing size, the number of coalescent 
histories for the matching gene tree increases exponentially. 
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Figure 2: Coalescent histories with matching gene trees and species trees. (A) A matching coalescent history. (B) 
Condition (a) for matching coalescent histories is violated because leaf B descends from node k but not from branch h(k). 
(C) Condition (b) for matching coalescent histories is violated, as node fe descends from node k\ but the branch h(k 2 ) 
remains strictly above the branch h[k\). 


Here, we show that this prediction does not always hold. Indeed, we exhibit a family of species trees (X n ) n — 
that we term the lodgepole family—whose number of coalescent histories grows with the double factorials, and 
thus increases at a rate that is faster than exponential in the number of taxa. We use the lodgepole family to 
further understand the variability at a given m of the number of coalescent histories for cases with matching 


gene trees and species trees. Rosenberg (2007) obtained a lower bound on the ratio of the largest number of 


coalescent histories to the smallest number of coalescent histories at m taxa, showing that this ratio was greater 
than a constant multiple of (•^/7r/32)[(5m — 12)/(4m — 6 )]my/m. Here we improve substantially upon this lower 
bound, demonstrating that it exceeds the much larger [y/m — 1/(4\/e)] ,7 \ 


2 Preliminaries 

2.1 Species trees and coalescent histories 

A species tree is a binary rooted tree equipped with a labeling for the leaves. As in other studies of coalescent 
histories, a single labeling can without loss of generality be taken as representative of an unlabeled species tree 
topology. When the labeling is not needed, we abbreviate the arbitrarily labeled species tree by its unlabeled 
shape and consider the labeled and unlabeled topologies interchangeably. We consider matching gene trees and 
species trees with the same labeled topology t. 

We term a coalescent history for the case when the gene tree and species tree have the same labeled topology 
a matching coalescent history. Given a species tree t, a mapping h from the internal nodes of t to the branches of 
t is a matching coalescent history of t when it satisfies both of the following two conditions: (a) for all leaves x 
in t, if x descends from internal node k in t, then x descends from branch h(k) in t; (b) for all internal nodes k\ 
and 7’2 in t, if 7’2 is a descendant of k\ in t, then branch h[k 2 ) is descended from or coincides with branch h(k\) 
in t. Figure [2)4 shows an example of a matching coalescent history. The examples in Figure [2)3 and (2)3 are not 
matching coalescent histories; in Figure [2)3, condition (a) is violated, and in Figure [2p, condition (b) is violated. 

2.2 The lodgepole family of species trees 

We focus here on the number of matching coalescent histories ( histories or coalescent histories for short) for a 
particular family of species trees, (A n ) n >o, that we call the lodgepole family. We define Ao as the 1-taxon tree. 
For n > 0, we inductively define A n+ i by appending A n and a tree with two leaves (a cherry) to a common root 
(Fig. [3|). Note that the tree X n has m = 2n + 1 rather than n taxa; we use n to denote the nth tree X n of the 
lodgepole family and perform our enumerations according to this parameter, later returning to m, the number 
of taxa. We view X n as unlabeled, or as having an arbitrary labeling. 

The lodgepole family (A n ) n >o can be seen as a modification of the caterpillar family of species trees, in which a 
family of trees is generated by sequentially appending the previous tree in the family and a single taxon—instead 
of a cherry, as in the lodgepole family—to a common root. 
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Figure 3: The lodgepole family of species trees A„. Starting from the tree with one taxon (Ao), by adding n > 0 cherries, 
we obtain the tree A„. The term lodgepole is after the lodgepole pine tree, Pinus contorta , one of a number of pine species 
in which needles extend from the main twig in bundles of two. 


2.3 Dyck paths 

To enumerate histories for lodgepole species trees, we make use of results that involve certain lattice paths, the 
Dyck paths ( Stanley , 199^ ). A Dyck path of size n is a lattice path that starts from (0,0) and ends at (2n,0) 
in the quarter plane, that has n unit steps up (each labeled U ) and n unit steps down (labeled D), and that 
never passes below the x-axis (Fig. |4]A). It is useful to distinguish the indecomposable Dyck paths from the 
decomposable ones. A Dyck path of size n is said to be indecomposable when it touches the x-axis only at the 
extreme points (0,0) and (2n, 0). A Dyck path is decomposable if it is not indecomposable. In Figure [4)3, the 
two Dyck paths at the top are indecomposable, and the remaining three are decomposable. 


3 The number of matching coalescent histories for lodgepole species trees 

3.1 Overview 

We are now ready to compute the number h n of matching coalescent histories for the lodgepole tree X n . We start 
in Section 13.21 bv obtaining a combinatorial formula that computes h n as a sum over a certain set of vectors V n . 
In Section 13.31 we show that by a bijection of coalescent histories for X n with a certain set of Dyck paths D n —a 
set that is in turn related to structures known as indecomposable histoires d'Hermite —we can apply existing 
enumerative results to obtain a recursion for h n . Finally, in Section 13.41 we study the asymptotic behavior of h n . 

3.2 A first combinatorial formula for h n 

For n > 1, we define a set V n of integer vectors a = (oi, ci 2 ,..., a n ) as 

V n = {a : a\ = 2 and 2 < a* < aj_i + 1 for 2 < i < n}. 

Setting, for instance, n = 3, we obtain V 3 = {(2, 2, 2), (2, 2,3), (2,3,2), (2,3,3), (2, 3,4)}. 

We have the following combinatorial formula to compute, for n > 1, the number of matching coalescent 
histories h n for the lodgepole species tree X n : 


h n = Y, II' 

aeVn i= 1 


( 2 ) 


Eq. [2] can be justified by formulating the procedure of Rosenberg ( 2007 ) for tabulating coalescent histories 
specifically in the lodgepole case, observing that a history of X n can be constructed in two steps. In the tree 
A n , it is convenient to distinguish a main branch, that is, the one from which the n cherry nodes in X n descend 
(Fig. [5]A). The main branch of X n thus contains n internal nodes that we treat as ordered from the root (the 


4 


















Figure 4: Dyck paths. (A) The Dyck path of size 4 whose sequence of steps is UUDUDDUD. It contains 4 unit up-steps 
U and 4 unit down-steps D that never pass strictly below the x-axis. (B) The five possible Dyck paths of size 3. The two 
at the top are indecomposable because they touch the cc-axis only at the endpoints ( 0 , 0 ) and ( 6 , 0 ). 


first node) toward the single leaf at the end of the branch. In step (a), we fix a history for the nodes of the 
main branch, ignoring the attached cherries. In Figure [5)4, this history is represented by the solid arcs: each 
arc maps a node of the main branch onto a permissible branch. In step (b), we choose a mapping for the cherry 
nodes (dashed arcs in the figure). The choice for the mapping of the cherry nodes must be compatible with the 
mapping in step (a) for the nodes of the main branch of A„.. As required by the definition of coalescent histories 
in Section 12.11 the image of a cherry node k cannot be placed on a branch above the one chosen in step (a) as 
the image of the node of the main branch to which node k is appended. 

The two-step procedure translates into eq. El In fact, each possible history of the main branch of X n can 
be bijectively encoded by a vector of integers (ai, ...,a n ) E V n by noting that the itli node of the main branch 
is mapped exactly a* — 2 nodes above it, associating each node with its immediate ancestral branch (Fig. [5]A). 
Once the vector has been fixed, the cherry node appended to the ith node of the main branch can be mapped in 
exactly a* compatible ways. Therefore, with the sum in eq. El we are considering all the possible histories of the 
main branch—those constructed in step (a)—and for each of these histories, the product counts the number of 
compatible mappings of the cherry nodes as considered in step (b). 

By applying eq. El setting Hq = 1 for convenience, we computed the first terms of the sequence h n (Table E]). 
The values for n = 1, 2,3,4 accord with the value s co mputed in the enumerations of coalescent histories reported 
for small trees in Tables 1 and 4 of Rosenberg ( 20071 ). 


3.3 Correspondence with the histoires d’Hermite and a recursion for h n 

We now show that a bijective correspondence exists between histories of the lodgepole family (A n ) n >o and certain 
labeled paths in the plane. Indeed, note that as in the example in Figure [5)3, each vector a E V n bijectively 
encodes a Dyck path of size n > 1. More precisely, starting from the vector a, a Dyck path with n up-steps is 
uniquely determined by fixing for each i the ordinate yt of the endpoint of the ith up-step according to 

at = Vi + 1. (3) 

For instance, in Figure[5j3, we depict the Dyck path UUDUDDUD associated with the vector a = (2,3, 3, 2) E V 4 : 
in fact, as in eq. El we have y\ = 1 = ai — 1 , y 2 = 2 = 02 — 1 , y^ = 2 = 03 — 1 , and j /4 = 1 = 04 — 1 . 

Defining D n as the set of Dyck paths of size n that for each i have the zth up-step U labeled by an integer 

e(Ui)€[l, yi + l], ( 4 ) 

we can also interpret eq. El as the formula that computes the cardinality | D n |, so that 

hn — | D n |. (5) 
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Figure 5: Combinatorial structures for computation of h n : coalescent histories, labeled Dyck paths, and histoires d’Hermite. 
(A) Coalescent histories of A4. Arcs represent the mapping of the nodes of A4 to its branches. Each history of A„ can 
be constructed in two steps. First, a mapping of the nodes of the main branch to branches of the tree is fixed. Next, a 
compatible mapping of the cherry nodes is constructed. For the nodes of the main branch, the mapping in the figure is 
encoded by the vector a = (01,02,03,04) = (2, 3,3, 2) £ V4: for i = 1,2, 3, 4 , the zth node of the main branch is mapped 
onto the branch <n — 2 nodes above it (solid arcs). Dashed arcs represent possible mappings of the cherry nodes that are 
compatible with the mapping for the main branch determined by the vector a. The zth cherry node can be mapped in 
exactly o* compatible ways. (B) A labeled Dyck path of size 4, encoding the vector (01,02,03,04) = (2,3,3, 2) £ V4 from 
(A). The ordinate yi of the endpoint of the zth up-step Ui satisfies yi = a\ — 1. By labeling each up-step [/, of the path with 
an integer £{Ui) £ [1, zy + 1], we obtain a path in D4. The number of ways that the underlying Dyck path can be labeled 
is thus given by the product YYIZ4 a % = 36. (C) The indecomposable histoire d’Hermite of size 5 associated with the Dyck 
path of size 4 in (B). The histoire is obtained by adding an up-step labeled 1 at the beginning of the path in (B) and a 
down-step at the end and keeping the labels of the remaining up-steps as in (B). In this way, the zth up-step XJi of the 
histoire in (C) has an integer label £*([/,) £ [l,z/i], where, as in (B), yi is the ordinate of the endpoint of the zth up-step. 


Eq. [5] holds for n > 0 including the case n = 0, as we set /io = 1, and by counting the empty path, Do = 1. In 
interpreting eq. [2] as an enumeration of labeled Dyck paths in D n , the sum in eq. [2] traverses all possible Dyck 
paths of size n as encoded by vectors in V n . For each of these Dyck paths, the product n /=1 a « computes the 
number of ways that the path can be labeled. By eqs. [3] and [4] each label £(Uj) has a* possible values. 

This result, similar to a bijection with monotonic paths used by Degnan ( 20051 ) to count coalescent histories 
in the caterpillar case, allows us to switch from counting the histories of A n to counting labeled paths in D n . 
The correspondence in eq. [5] can be used to obtain a recursion for h n . Starting with ho = 1, we have for n > 1, 


n— 1 

h n = (2 n + 1)!! - ^(2 k + 1)!! h n - i_ fc . (6) 

k =0 

To prove eq. ( 6 J we make use of the relationship between the labeled Dyck paths in D n and the so-called histoires 
d’Hermite of size n + 1 (histoires for short, Fig. [5p). 

An histoire of size n > 1 is a labeled Dyck path of size n, but with a labeling scheme £*(U t ) for its up-steps 
that slightly differs from the scheme £(Ui ) considered in eq. 0] for the paths of D n . Indeed, in an histoire, for 
each i, the zth up-step Ui carries an integer label 


e(Ui)€[l,yi], 


(7) 


where yi is, as before, the ordinate of the endpoint of step Ui. Note that for histoires of size n, we have yi possible 
values for each label £*(Ui), whereas for Dyck paths in D n , we had y, + 1 possib ilit ies for label £{Ui). 

Denote by H n the set of histoires of size n. Section 1.2 of Roblet and Viennot ( 19961 ) found that for n > 1, 


\H n \ = (2 n - 1)!! = (2n - 1) x (2n - 3) x ... 3 x 1. 


( 8 ) 
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Table 1: The number of matching coalescent histories. 


Number of taxa m Number of matching coalescent histories 

(m = 2 n + 1) 



Predicted by the 
linear regression 

model 

Exact value 
h(m-1)/2 f° r 
lodgepole tree 

of 

the 

Upper bound 
/i“ based on 
caterpillars” 

for 

“bi- 

Lower bound for 
based on lodgepole 
trees 

3 

1 

2 


2 


2 

5 

12 

10 


10 


14 

7 

103 

74 


65 


138 

9 

888 

706 


481 


1,663 

11 

7,679 

8,162 


5,544 


6,237 

13 

66,406 

110,410 


56,628 


90,090 

15 

574,261 

1,708,394 


613,470 


1,447,875 

17 

4,966,073 

29,752,066 


6,952,660 


25,844,568 

19 

42,945,396 

576,037,442 


81,662,152 


509,233,725 


Given m, we exponentiate the value from the regression model in Figure |T] and round to the nearest integer. The 
exact ht m - 1)/2 = h n is computed from eq. [2] or [6l For m < 9 . the upper bound for /i~ and the lower bound for 
/i+ are computed exactly from Tables 1-4 of Rosenberg! ( 20071 ) . For rn > 11, the upper bound for is computed 


as c n c n+ i, with c n as in eq. [I] Applying Proposition |T] and noting that (n — 2)/n = (m — 5)/(m — 1), the lower 
bound for is computed as m\\ (m — 5 )/(m — 1), rounding down where necessary. 


We say that an histoire of size n is indecomposable if its underlying Dyck path is indecomposable (Fig. [5]0). It 
can be observed that, denoting by H' n the number of indecomposable histoires of size n, we have for n > 0 

\Dn\ = (9) 

Indeed, as depicted in Figure 0 panels B and C, each labeled Dyck path P G D n can be bijectively mapped onto 
a labeled path P' G H' n+l that is obtained by adding an up-step U labeled with the integer 1 and a down-step 
D respectively at the beginning and at the end of P and keeping unchanged the labels of the remaining up-steps 
of P. In symbols, we have for n > 0 the bijective correspondence 

P G D n UPD = P' G H' n+1 . (10) 

Note in fact that according to the labeling scheme l* for histoires (eq. [7|), in P' only the label (*{U\) = 1 is 
possible for the new first up-step U\. The up-step U\ has ordinate 1. Furthermore, for all i with 1 < i < n, the 
ordinate y' i+l of the (i + l)th up-step in P' satisfies 

y'i+i = y* +1, 

where y* is the ordinate of the zth up-step in P. Therefore, keeping the labels of the up-steps of P unchanged, 
the labeling scheme t* for the histoires is satisfied by P' , as can be seen by comparing eq. [4] and eq. [71 Finally, 
by construction, the path P' touches the x-axis only in the extreme points and is by definition indecomposable. 
By the bijection in eq. [101 we thus have eq. [9j 
Combining eqs. [5] and [9l we obtain for n > 0 

hn = \H n +i\- ( 11 ) 

We denote by H" the set of decomposable (not indecomposable) histoires of size n > 1. An histoire in H" +1 can 
be decomposed uniquely as a concatenation of an indecomposable histoire in H' n+l _ k for some k, 1 < k < n, and 
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Figure 6 : A decomposition of a decomposable histoire d’Hermite. Any decomposable histoire in H " +1 for n > 1 is uniquely 
obtained by concatenating an indecomposable histoire in H' n+1 _ k with l<n+l — k<n (Fig. [5p) and an histoire in Hk 
with 1 < k < n. The point at which they touch corresponds to the first return to the x-axis of the entire path. The shading 
indicates that the indecomposable histoire on the left begins with an up-step, ends with a down-step, and does not reach 
the x-axis within the shaded trapezoid. 


a second histoire that is either decomposable or indecomposable and hence lies in Hk (Fig. [ 6 ]). The endpoint 
of the indecomposable histoire in H' n+1 _ k provides the first return of the decomposable histoire in H " +1 to the 
x-axis, after which the histoire in Hk might or might not touch the x-axis at a point in its interior. 

Applying the decomposition, we have for n > 1 

n 

\K +1 \ = Y^\K + i-k\\H k \- 

k=1 

Because the number of histoires in H n is known from eq. [ 8 ] and because each histoire is either decomposable or 
indecomposable, we obtain a recursion for the number of indecomposable histoires of size n + 1 : 



l-l 

u" l 

n n +ll 



= (2 n 

+ 

1 )!! 

~t 

\K+i-k\ 

\H k \ 




k=1 



= (2 n 

+ 

1 )!! 

-t 

l-^rz+1— k\ 

(2k — 1 )!! 




k=1 




By eq. [TQ we have demonstrated eq. El 

The fact that h n can be computed as in eq. [ 6 ] shows that the matching coalescent histories of \ n are equinu- 
merous with other combinatorial structures. In particular, in addition to being the number of coalescent histories 
for lodgepole trees and the number of indecomposable histoires d’Hermi te o f siz e n + 1, h n appears in enumerat¬ 
ing topologically distinct Feynman diagrams of order n ( Jacobs! . 1981 ; Battaglia and George , 19881 ), as well as in 
counting for an alphabet of size n + la class of “irreducible” words in which each le tter appear s exac tly twice, 
and in which the first appearances of the letters appear in a canonical order (Burns and Mucha . 2011). 


3.4 Asymptotic behavior of h n and its consequences 

We now turn to using our recursion in eq. [ 6 ] to determine asymptotic properties of the number h n of matching 
coalescent histories for lodgepole species trees. From eq. [ 6 j it immediately follows for n > 0 that 


h n < (2n + l)!!. 


(12) 




















Therefore, dividing both sides of eq. [ 6 ] by (2n+l)H, for n > 1, we can write 

h„ 1 x v 7 ^ 

- Ml! “ 1 ~~ ^ ( 2 n + 1 )!! ^ n ~ 1 ~ k ~ 1 ~~ ^ 


1 > 


( 2 n + 1 )!! 


^( 2 fc + l)!! > ( 2 fc + 1 )!! [ 2 (n — 1 — fc) + 1 ]!! 

Z^ /9„ 4- I'll! Ln 1 k — Z^ 


k =0 


( 2 n + 1 )!! 


(13) 


The final step in ea. 1131 follows by replacing h n _ i_fe with the upper bound [2(n — 1 — fc) + 1]!! from inequality 1121 
Using the fact that 

(2„+l)!! = S|±fl, (14) 

the sum in eq. [13] can be simplified as 


71 — 1 


Sr). — 


y- ( 2 fc + 1 )!! [ 2 (n — 1 — k) + 1 ]!! _ y-^ (fc+i) 
Z^ (2n + 1)!! “ Zw2n+2V 


k=0 


_(2n+2\ 
k=0 \ 2 fc+ 2 / 


(15) 


For n = 1, we have si = In the Appendix, we show that for n > 1, the sequence (s n )n>i satisfies the recursion 


For n > 1, the upper bound 


(2?r + 3)s n +i — (?r + 2)s n + 1. 
2 




n 


(16) 


(17) 


can be verified by induction. Because si = 5 , «2 = f, and S 3 = inequality [IT1 holds for n = 1 , 2 ,3. By eq. [T 6 l 
the inductive hypothesis yields (2n + 3)s n+ i < 2(n + 2)/n + 1 = (3n + 4)/n, so that 


3n + 4 2 

S'n+’l T t r 


n(2n + 3) n + 1 ’ 


where the latter inequality holds for n > 3. 

Therefore, from inequalities [13] and [T71 we have for n > 1 


1 _ _ < l _ Sn < 

n 


h n 

--- < 1 

( 2 n + l)H - ’ 


which finally gives the bounds 


(2n + 1)!! 


n — 2 


n 


<h n < ( 2 n + l)H, 


and the asymptotic relationship h n ~ (2n + 1)!!. We summarize our results in a proposition. 
Proposition 1 The number h n of matching coalescent histories for the lodgepole family (A n )n>o 

n —1 


hn — (2n + 1)!! — ^ ^ (2fc + 1)!! h n _i_f c , 


k =0 

where we set ho = 1. T/ie following bounds hold for n > 1: 

' n — 2 


( 2 n + 1 )!! 


n 


and asymptotically, we have 


h n ~ ( 2 n + l)H ~ \/2 


</in< ( 2 n + l)H, 
2 (n + 1 ) 1 n+1 


(18) 


(19) 


( 20 ) 
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The asymptotic approximation in eq. 1201 follows from Stirling’s approximation n\ ~ \f2irn(n/e) n and from 
an equivalent form of eq. (2 n + 1)!! = (2 n + 2)!/[2 n+1 (n + 1)!]. By Proposition [TJ using an odd number 

m = 2n + 1 describing the number of taxa in the lodgepole species tree \ n , we have the following corollary. 


Corollary 1 There exists a family of species trees whose number of matching coalescent histories grows faster 
than exponentially in the number of taxa m. In particular, when rn is odd, the number of matching coalescent 
histories for the lodgepole species tree \ m -i )/2 with fn> 1 leaves is asymptotically 


h 


( m — 1)/2 


~ mil ~ 


V 2 


Im + 1 


m +1 


( 21 ) 


Rosenberg (2007) studied the variability across all species trees for a fixed number of taxa m of the number 
of matching coalescent histories by examining a ratio 

R(m) = S 
hm 

where hf j denotes the number of coalescent histories for the m-taxon species tree with the greatest number of 
matching coalescent histories and denotes the corresponding value for the smallest number of histories (/i“). 
In Theorem 3.18, Rosenberg ( 20071 ) reported a lower bound on R(m ) for m > 2: 


R(m) > ( — 


7 r\ /5 m — 12 


4m — 6 


myjm. 


( 22 ) 


Our computations with lodgepole species trees substantially increase the lower bound for /i+. By using inequal¬ 
ity OH we can improve on the lower bound on R(m ) for the case of m odd. 

Corollary 2 Let R(m ) = denote the ratio of the numbers of matching coalescent histories for the m- 

taxon species trees with the greatest and smallest numbers of coalescent histories. Then, for odd m > 7, 


R(m) > 


y/m — 1 
4:yfe 


(23) 


P roof . Because m is odd, we fix m = 2n+l and switch between indexing by m and by n. First, for , Rosenberg 
( 20071 ) considered a “bicaterpillar” tree from whose root descen ded two caterpillar subtrees, with [m/2j = n and 
[m/2] = n + 1 taxa. This tree has c n c n +\ histories (jRosenbergl . 120071 . Theorem 3.10), where c n is the Catalan 
number as in eq. |T] so that ^ c n c n+ i. 

Now, for , we use the lodgepole species tree \ n with m = 2n + 1 leaves to provide a lower bound on the 
number of matching coalescent histories for the tree with the largest number of matching coalescent histories, so 
that /i+ > h n . By inequality [HI 

{ n — 2 N 

h+>h n >(2n + l)\\l- 

Therefore, with c n as in eq. H 


n 


Rfrn) = ^ > (2n + 1)H ( 

hm c n c n ~ i-i \ n 


(2n + l)(n — 2)(n + 1) (?z + 2) n! 


2 n n 


(2n+2\ ’ 
l n +1 > 
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where we have again used eq. HD Using the Stirling bound nl > y/2Tm(n/e) n and noting that < 4 n , we have 

(2n + l)(n — 2)(n + l)(n + 2) y/2-im{n/e) n 


R(m) > 


2 n n 4 n+1 

(2n + l)(n — 2)(n + l)(n + 2)\/2vrn / n 
4n V 8e 


Substituting n = (m — l)/2 in inequality 1241 to consider the number of taxa m = 2n + 1 yields 

. . y/e7rm(m — 5)(m + l)(m + 3) (\Jm — 1V" 

-I^vi)- (“vT) ’ 

which gives, if m > 7, inequality [23] (and is in fact stronger than the simpler eq. 1231) . □ 


(24) 


(25) 


4 Discussion 

We have defined the lodgepole family of species trees (A n ) n and studied the growth of the number h n of matching 
coalescent histories for \ n as a function of n, showing that asymptotically, h n ~ (2n+l)H. For m odd, the number 
ht m - 1)/2 of matching coalescent histories for the lodgepole species tree with m taxa grows with mil. Previous 
enumerative results for other species tree families have found that the number of coalescent histories increases 
only exponentially; we have demonstrated the existence of a family of species trees for which the number of 
matching coalescent histories grows more quickly than exponentially in the number of taxa (Corollary [T])- 

Our results for lodgepole species trees indicate that the exponential increase in the number of matching 
coalescent histories observed in Figure |T] is misleading, at least in regard to the largest numbers of matching 
coalescent histories at a fixed number of taxa. We can consider the linear regression model obtained in Figure U 
representing exponential growth—alongside an upper bound for /i“ , the smallest number of matching coalescent 
histories at m taxa, and our new lower bound for /r+, the largest number of matching coalescent histories at 
m taxa (Table [T|). This comparison illustrates that whereas the linear model is reasonable at the small values 
of m depicted in Figure CD it becomes increasingly unreasonable in predicting . Indeed, a consequence of 
the enumeration for lodgepole families is a substantially larger lower bound for the variability of the number of 
matching coalescent histories for species trees of fixed size (Corollary [2|). 

The lodgepole trees differ from the caterpillars in that pairs of leaves rather than single leaves are descended 
from the internal nodes along the main branch. That the lodgepole species trees have such faster growth 
in their number of matching coalescent histories compared to the caterpillar species trees indicates that this 
apparently minor change in the branching structure of species trees leads to qualitatively different results in 
the number of histories. By contrast, it has been found that certain other changes to the caterpillars, replacing 
a caterpillar subtree by a non-caterpillar subtree, change the asymptotic growth in the number of matching 
coalescent histories only by a change to the constant multiple of the Catalan numbers, and do not change the 
overall growth rate ( Rosenberg . 2007 . 20131 ). 

The results have the implication that although the numbers of coalescent histories for relatively small species 
trees remain small enough for reasonable computation times involving enumerations of coalescent histories, the 
most challenging cases can grow more rapidly in the number of taxa than has been suggested in the cases that 
have been previously examined. It will be important to determine whether the challenging lodgepole scenario 
arises in practical settings, as well as the possibility that even more challenging families exist, for which the 
growth rate is even faster than in the lodgepole case. 

The links in our analysis to Dyck paths and the histoires d’Hermite, and the appearance for the number of 
matching coalescent histories of lodgepole species trees of a sequence arising in other counting problems, identify 
known combinatorial structures to which coalescent histories can be related. These connections are promising 
for additional future computations about coalescent histories. 
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Appendix 


This appendix proves eq. [16] from eq. [J5] We define for n > 1 and 0 < k < n, F(k,n}_ — / ( 2 ”! ?) an< ^ 

R(k , n) = 2n — 2k + 1 . We use F(k , n) and n) to apply the summation methods of Petkovsek et al. ( 19961 ). 
It can be verified algebraically that 


2{n + 2 )F(k, n ) — 2(2n + 3 )F(k, n + 1) = F(k + 1, n)R(k + 1, n) — F(fc, n)R(k , n). (26) 

Indeed, the identity follows by dividing both sides of eq. [26] by the nonzero F(k, n ) and applying the ratios 
F(k,n + 1 )/F(k,n) = (2n — 2fc + l)/(2n + 3) and F(k + l,n)/F(k,n) = (2k + 3)/(2 n — 2k — 1). Summing both 
sides of eq. [26] for k from 0 to n — 1, the right-hand side telescopes, giving a final contribution of F(n, n)R(n, n) — 
F(0,n)R(0,n). Therefore, we obtain 


fn —1 


2(n + 2) ( ^2 F{k, n) J — 2(2 n + 3) 


0 

■m— 1 


F(fc, n + 1) j — F(n, n + 1) 


/ L \k =0 

Taking s n = YlkZo n) as in eq. [I5l yields 


= F(n, n)R(n , n) — F( 0, n)R(0, n ). 


2(n + 2)s n — 2(2n + 3) f s n _|_i — ——— 

\ 2n + 3 


= 0 , 


from which eq. [16] immediately follows. 
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